{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Lab 4 - Histograms and green taxi trips\n", "\n", "A histogram is simlar to a bar chart, but for quantitative data. The possible data values are divided into equal intervals (*bins*), and the number of data values in each bin is counted. A histogram represents these counts as bars, similar to a bar chart. A histogram visualizes how frequently different data values occur, and we call this information a *distribution*.\n", "\n", "We will use the green taxi trip dataset from Lab 3. \n", "\n", "As usual, we have to import the `matplotlib` and `pandas` packages, and change the setting so plots will appear directly in the Jupyter notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we need to load the data from the CSV file into a dataframe called `taxi`. The two columns `lpep_pickup_datetime` and `lpep_dropoff_datetime` should be set as dates. See if you remember the code to do this, and if not, look at Lab 3." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check that the `taxi` dataframe now exists by displaying it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A histogram visualizes the data from a single column in the dataframe. We will first look at the `trip_distance` column, which stores the distance (in miles) recorded by the taximeter for the trip.\n", "\n", "Can you write the code to display just this column? Try it below. (We did this in Labs 2 and 3.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Pattern:\n", " dataframe_name[\"column_name\"]\n", "
\n", "\n", "To make a histogram of the data in the `trip_distance` column, type and run the code `taxi[\"trip_distance\"].hist()`. The `.` between `taxi[\"trip_distance\"]` and `hist()` means we are applying the histogram function `hist()` to `taxi[\"trip_distance\"]`, which is the column `trip_distance` in the dataframe `taxi`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you notice about the histogram? Which trips are most frequent: shorter or longer trips? Does this make sense?\n", "\n", "Let's label the x and y axes, and add a title. Try running the following code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "taxi[\"trip_distance\"].hist()\n", "plt.xlabel(\"Trip distance (miles)\")\n", "plt.ylabel(\"# of trips\")\n", "plt.title(\"Green Taxi Trip Distance Distribution\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can you figure out which line does what? You can think of `plt` as a built-in variable that refers to the latest graph made. Technically, `plt` is something more complicated, but the details are not important at this point in time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also change the number of bins by a parameter. Try the code `taxi[\"trip_distance\"].hist(bins = 20)`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How did the plot change? What happens if you change the number of bins to 40?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Has the shape of the graph changed? What could explain this?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Let's make a histogram to visualize the distribution of data in the `passenger_count` column. Can you figure out the code for this?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " taxi[\"passenger_count\"].hist()\n", "
\n", "\n", "What do you notice? Is this what you would expect?\n", "\n", "Let's compare the histogram of `passenger_count` with the bar chart of `passenger_count`. Write the code to make the bar chart below. Look at Lab 3 for a reminder of how to do this." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", "counts = taxi[\"passenger_count\"].value_counts()\n", "counts.plot(kind = \"bar\")\n", "
\n", "\n", "How are the histogram and bar chart the same? Different? Which one do you prefer?\n", "\n", "#### Challenges\n", "- Plot a histogram of the `total_amount` column, which is the total amount (including tolls, tip, etc.) paid for the trip. \n", "- Add axis labels and a title to the passenger count histogram. Can you figure out how to add these to the bar chart?\n", "- Does the distribution (histogram) of trip distances at night look different than the distribution of the trip distance during the day? To test this, download two new datasets from NYC Open Data, one dataset filtered to be trips from one night, and the other filtered to be trips from one day." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }